if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
# use this line for installing/loading# pacman::p_load() #devtools::install_github("tidyverse/dsbox")install.packages("readr", repos ="http://cran.us.r-project.org")
The downloaded binary packages are in
/var/folders/_d/hvsdqqnd3jddpd6y2p2zdpt40000gn/T//Rtmp7rNUzM/downloaded_packages
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")
1 - Road traffic accidents in Edinburgh
accidents <-read_csv("data/accidents.csv")
Rows: 768 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): id, severity, date, day_of_week, highway, first_road_class, road_...
dbl (12): easting, northing, longitude, latitude, police_force, vehicles, c...
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
accidents2 <- accidents[c("day_of_week", "time", "severity")]accidents2 <- accidents2 %>%group_by(day_of_week) %>%mutate(is_weekend =if(any(day_of_week=='Sunday'| day_of_week =='Saturday')) "Weekend"else"Weekday")g <-ggplot(accidents2, aes(x=time, group=severity, fill=severity))g <- g +geom_density(alpha=0.7) +facet_wrap(~is_weekend, nrow=2)g <- g +labs(x="Time of day", y="Density", title="Number of accidents throughout the day", subtitle="By day of week and severity") +scale_fill_manual("severity", name="Severity", values =c("Fatal"="#A694AE","Serious"="#A7C9C7", "Slight"="#FCF3A9"))g <- g +theme_minimal() g
This data documents fatalities by weekday, time of day, and the severity. This could be in an attempt to identify room for improvement, or road planning, or an initial discovery for root cause identification and further fatality prevention. We also notice that fatal severity occurs mostly on the weekends, and not during the week; there are 3 major spikes - around 10pm, early morning, and around noon.
2 - NYC marathon winners
2a
We see that with the two plots below, there is a better understanding of outliers when viewing the box plot. However, in the histogram, there appear to be two distributions: a high number of runners around the 2 hour mark, and another high number of runners around the 2:30 mark. This is not seen in the boxplot, which only shows the mean of the entire data set.
nyc <-read_csv("data/nyc_marathon.csv")
Rows: 102 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, country, division, note
dbl (2): year, time_hrs
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# glimpse(nyc)g <-ggplot(nyc, aes(x=time)) +geom_histogram(alpha=0.9, fill="lightblue") +theme_minimal()g <- g +labs(x="Total Time", y="Frequency", title ="NYC Marathon winner times, 1970-2000")g
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_bin()`).
g <-ggplot(nyc, aes(x=time)) +geom_boxplot(alpha=0.9, fill="lightblue") +theme_minimal()g <- g +labs(x="Total Time", title ="NYC Marathon winner times, 1970-2000")g
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).
2b
When stratifying by gender, we find that the reason for two distributions as seen from the histogram above is due to difference in performance by gender. Women have a higher winning time than men.
g <-ggplot(nyc, aes(y=time, color=division)) +geom_boxplot() +scale_color_manual(values=c("#E69F00", "#7aa9a9" ))g <- g +theme_minimal() +theme(axis.ticks.x=element_blank(), axis.text.x=element_blank()) +labs(y="Time", title="NYC Marathon winners 1970-2000 comparison via Men and Women")g
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).
2c
I don’t believe that any of the data from 2b is redundant. Comparing both men and women via box plots shows us useful information, and maintains the minimal data to ink ratio that we desire. We could consider eliminating outliers, doing so would focus on the core values instead of any high or low performing years.
g <-ggplot(nyc, aes(y=time, color=division)) +geom_boxplot(outlier.shape=NA) +scale_color_manual(values=c("#E69F00", "#7aa9a9" ))g <- g +theme_minimal() +theme(axis.ticks.x=element_blank(), axis.text.x=element_blank()) +labs(y="Time", title="NYC Marathon winners 1970-2000 comparison via Men and Women")g
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).
2d
With a time-series based model twe can more clearly see the difference across years. What we notice that the spikes between men and women are consistent across years. This could be due to the chnage in the course every year, causing fluctuation in times.
g <-ggplot(nyc, aes(x=year, y=time, color=division)) +geom_line() +scale_color_manual(values=c("#E69F00", "#7aa9a9" ))g <- g +theme_minimal() +labs(x="Year", y="Finishing Time", title="NYC Marathon winners 1970-2000 comparison via Men and Women")g
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
3 - US counties
3a
This does not make sense. It compares two different things, a smoking ban and education against household income. As another note, the y axis is subtly two different values, but of the same type (population and household income).
county <-read_csv("data/county.csv")
Rows: 3142 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, state, metro, median_edu, smoking_ban
dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(county) +geom_point(aes(x = median_edu, y = median_hh_income)) +geom_boxplot(aes(x = smoking_ban, y = pop2017))
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
3b
The second plot makes it easier to compare poverty levels. Primarily this is due to the ratio of the x and y axes. The y axis on the first plot is too challenging to compare and distinguish between poverty levels. The second graph solves this question and makes it easier to review poverty levels. In addition, the standardization of poverty levels on the second set of graphs is easier to read because the comparison against poverty is easier to compare between graphs: i.e., when knowing what we want to compare (poverty), it is easier to compare when you can traverse across faceted graphs easily.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
4 - Rental apartments in SF
4a
Based on the graphs below, there is a positive correlation between your income and your credit card balance. In respect to the difference of being in a relationship, there is still a positive correlation between the two, however with those being married, both income and credit card balance are higher. The same applies for being a student, however this could be an outlier as there are not many student observations.
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
credit <-read_csv("data/credit.csv")
Rows: 400 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): student, married
dbl (3): balance, income, limit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I think that both might be a useful feature when predicting credit card balance. They both have a positive correlation, and the correlation is consistent. If the correlation were inconsistent then perhaps I would suggest to not include it in predictions. In addition, from domain knowledge it does seem reasonable that a married person would have a higher credit card balance.
The relationship of credit utilization is different for students. When comparing credit utilization versus balance, students have a positive correlation for credit card balance, but a negative one if they are not married, and a marginally negative one for those that are married. For non-students, we still find a positive correlation of credit utilization to income.
5 - Napoleon’s march.
knitr::opts_chunk$set(fig.width =7, # 7" widthfig.asp = .4, # the golden ratiofig.retina =3, # dpi multiplier for displaying HTML output on retinafig.align ="center", # center align figuresdpi =300# higher dpi, sharper image)# load data set nap <-read_rds("data/napoleon.rds")# I created subsets fo data here for ease of use when graphing.# I did so for me to use geom_path and identify coloring and direct lines, instead of using someing like geom_line(), which would not perform in a similar manner. Subsets of data are stratified by group trooping and advancing or retreating.troops <- nap$troopsadvancing <- troops |>filter(direction =="advancing"& group==1)retreating <- troops |>filter(direction =="retreating"& group ==1)h <-ggplot(advancing, aes(x=long, y=lat, size=survivors), colors=direction)adv2 <- troops |>filter(direction =="advancing"& group==2)rtr2 <- troops |>filter(direction =="retreating"& group ==2)adv3 <- troops |>filter(direction =="advancing"& group==3)rtr3 <- troops |>filter(direction =="retreating"& group ==3)alpha_setting <-1# I tried different alpha settings but did not like the results. # I created a temporary graph called h and p. I used this for testing in the console for ease of use when trying out different graph additions, and have kept it.# Things done here# - Add a scale of the geom_path for the width of each band# - Add paths for each group and each type (retreating, advancing) p<- h +scale_size(range=c(2, 10)) +geom_path(data=retreating, aes(long, lat, size=survivors, color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting) +geom_path(data=rtr2, aes(long, lat, size=survivors, color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting) +geom_path(data=rtr3, aes(long, lat, size=survivors, color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting) +geom_path(data=adv3, aes(long, lat, size=survivors, color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting) +geom_path(data=adv2, aes(long, lat, size=survivors, color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting) +geom_path(aes(color=direction), linejoin ="bevel", lineend="round", alpha=alpha_setting)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Here I consolidate the p graph with the final theme. # - Get rid of the background, legend, etc.# - Add labeling of the cities# - Add custom coloring of the paths# - Get rid of the y scaling and x scaling, as that is not used in the original for the path itself# - Remove x and y axis labelsminard_plot <- p +theme(panel.background=element_rect(fill="white", colour="white", linewidth =1), panel.grid.major.x=element_blank(),panel.grid.minor.x=element_blank(),,legend.position="none", ) +geom_text(data=nap$cities, aes(label=city, size=10, y=lat+.1)) +scale_colour_manual(values=c("#3CA76E", "#91383C"), aes(alpha=0.5)) +scale_y_continuous(breaks=NULL) +scale_x_continuous(breaks=NULL) +labs(x="", y="")
# In this section I create the temperatures graph. # Add a column of the formatted dates, similar to what is used in the original graph# - Also change the y scaling to what is used in the original graph# - Get rid of y labels# - Add the customized date formatting# - Add the other text of the temperature itself next to the datetemps <- nap$temperaturestemps['formatted_dates'] <-format(temps$date, "%d of %b")temp_plot <-ggplot(temps, aes(long, temp)) +geom_line() +theme(panel.background=element_rect(fill="white", linewidth =1), panel.grid=element_line(color ="lightgrey"),panel.grid.major.x=element_blank(),panel.grid.minor.x=element_blank(),) +scale_y_continuous(breaks=seq(0, -30, by=-10), position="right") +labs(y="") +geom_text(aes(label=formatted_dates, x = long + .4, y=temp -2), size=2) +geom_point() +geom_text(aes(label=temp, x = long -0.25, y=temp -2), size=2)
# Arrange both the Napoleon's path and the temperatures together.grid.arrange(minard_plot, temp_plot, nrow=2)
References
I used the following :
https://stackoverflow.com/questions/44869074/using-ggplot2-how-to-set-the-tick-marks-intervals-on-y-axis-without-distorting for scalling identification
Some of this: https://github.com/andrewheiss/fancy-minard
In addition, time spend via R Studio’s Help functionality (built in documentation)